Setup and Context¶

Introduction¶

On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.

Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.

Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.

Let's see what patterns we can find in the data of the past Nobel laureates. What can we learn about the Nobel prize and our world more generally?

Upgrade plotly (only Google Colab Notebook)¶

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.

Both ! and % allow you to run shell commands from a Jupyter notebook. See saved links in folder IPython magic commands ! $ % etc - run shell commands for difference.

!pip install --upgrade plotly

Import Statements¶

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

Notebook Presentation¶

In [2]:
pd.options.display.float_format = '{:,.2f}'.format

Read the Data¶

In [3]:
df_data = pd.read_csv('nobel_prize_data.csv')

Caveats: The exact birth dates for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd.

Data Exploration & Cleaning¶

Challenge: Preliminary data exploration.

  • What is the shape of df_data? How many rows and columns?
  • What are the column names?
  • In which year was the Nobel prize first awarded?
  • Which year is the latest year included in the dataset?
In [4]:
df_data.sample(3)
Out[4]:
year category prize motivation prize_share laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO
693 1998 Medicine The Nobel Prize in Physiology or Medicine 1998 "for their discoveries concerning nitric oxide... 1/3 Individual Robert F. Furchgott 1916-06-04 Charleston, SC United States of America United States of America Male SUNY Health Science Center New York, NY United States of America USA
501 1979 Peace The Nobel Peace Prize 1979 NaN 1/1 Organization Mother Teresa 1910-08-26 Uskup (Skopje) Ottoman Empire (Republic of Macedonia) Republic of Macedonia Female NaN NaN NaN MKD
828 2009 Physics The Nobel Prize in Physics 2009 "for the invention of an imaging semiconductor... 1/4 Individual Willard S. Boyle 1924-08-19 Amherst, NS Canada Canada Male Bell Laboratories Murray Hill, NJ United States of America CAN
In [5]:
df_data.shape
Out[5]:
(962, 16)

--> big dataset with 16 columns

In [6]:
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   year                   962 non-null    int64 
 1   category               962 non-null    object
 2   prize                  962 non-null    object
 3   motivation             874 non-null    object
 4   prize_share            962 non-null    object
 5   laureate_type          962 non-null    object
 6   full_name              962 non-null    object
 7   birth_date             934 non-null    object
 8   birth_city             931 non-null    object
 9   birth_country          934 non-null    object
 10  birth_country_current  934 non-null    object
 11  sex                    934 non-null    object
 12  organization_name      707 non-null    object
 13  organization_city      707 non-null    object
 14  organization_country   708 non-null    object
 15  ISO                    934 non-null    object
dtypes: int64(1), object(15)
memory usage: 120.4+ KB

We could potentially convert "year" and "birth_date" into datetime objects.

In [7]:
print(f"The first Nobel prize was awarded in {df_data.year.min()}")
The first Nobel prize was awarded in 1901
In [8]:
print(f"The latest year in our dataset is {df_data.year.max()}")
The latest year in our dataset is 2020

Challange:

  • Are there any duplicate values in the dataset?
  • Are there NaN values in the dataset?
  • Which columns tend to have NaN values?
  • How many NaN values are there per column?
  • Why do these columns have NaN values?

Check for Duplicates¶

In [9]:
df_data.duplicated().any()
Out[9]:
False

Check for NaN Values¶

In [10]:
df_data.isna().any().any()
Out[10]:
True
In [11]:
df_data.isna().any().loc[df_data.isna().any()].to_frame().transpose()
Out[11]:
motivation birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO
0 True True True True True True True True True True

These are the columns that host NaN values.

In [12]:
df_data.isna().sum().loc[df_data.isna().sum() != 0].to_frame().transpose()
Out[12]:
motivation birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO
0 88 28 31 28 28 28 255 255 254 28

We can see above the number of NaN values for each column that contains some.

A lot of laureates are no associated with an organization (orga. name, city, country).

The most important columns(year, category, prize, full_name) don't have NaN values so the dataset is good.

Let us filter on the NaN values in the birth date columns:

In [13]:
col_subset = ['year','category', 'laureate_type',
              'birth_date','full_name', 'organization_name']
df_data.loc[df_data.birth_date.isna()][col_subset]  # we can reuse a subset like this!
Out[13]:
year category laureate_type birth_date full_name organization_name
24 1904 Peace Organization NaN Institut de droit international (Institute of ... NaN
60 1910 Peace Organization NaN Bureau international permanent de la Paix (Per... NaN
89 1917 Peace Organization NaN Comité international de la Croix Rouge (Intern... NaN
200 1938 Peace Organization NaN Office international Nansen pour les Réfugiés ... NaN
215 1944 Peace Organization NaN Comité international de la Croix Rouge (Intern... NaN
237 1947 Peace Organization NaN American Friends Service Committee (The Quakers) NaN
238 1947 Peace Organization NaN Friends Service Council (The Quakers) NaN
283 1954 Peace Organization NaN Office of the United Nations High Commissioner... NaN
348 1963 Peace Organization NaN Comité international de la Croix Rouge (Intern... NaN
349 1963 Peace Organization NaN Ligue des Sociétés de la Croix-Rouge (League o... NaN
366 1965 Peace Organization NaN United Nations Children's Fund (UNICEF) NaN
399 1969 Peace Organization NaN International Labour Organization (I.L.O.) NaN
479 1977 Peace Organization NaN Amnesty International NaN
523 1981 Peace Organization NaN Office of the United Nations High Commissioner... NaN
558 1985 Peace Organization NaN International Physicians for the Prevention of... NaN
588 1988 Peace Organization NaN United Nations Peacekeeping Forces NaN
659 1995 Peace Organization NaN Pugwash Conferences on Science and World Affairs NaN
682 1997 Peace Organization NaN International Campaign to Ban Landmines (ICBL) NaN
703 1999 Peace Organization NaN Médecins Sans Frontières NaN
730 2001 Peace Organization NaN United Nations (U.N.) NaN
778 2005 Peace Organization NaN International Atomic Energy Agency (IAEA) NaN
788 2006 Peace Organization NaN Grameen Bank NaN
801 2007 Peace Organization NaN Intergovernmental Panel on Climate Change (IPCC) NaN
860 2012 Peace Organization NaN European Union (EU) NaN
873 2013 Peace Organization NaN Organisation for the Prohibition of Chemical W... NaN
897 2015 Peace Organization NaN National Dialogue Quartet NaN
919 2017 Peace Organization NaN International Campaign to Abolish Nuclear Weap... NaN
958 2020 Peace Organization NaN World Food Programme (WFP) NaN

We see that we get back a bunch of organisations, like the UN or the Red Cross. That makes sense. We also see that since the organisation's name is in the full_name column, the organisation_name column contains NaN.

In addition, let's filter on the NaN values in the organization_name:

In [14]:
df_data.loc[df_data.organization_name.isna()][col_subset]
Out[14]:
year category laureate_type birth_date full_name organization_name
1 1901 Literature Individual 1839-03-16 Sully Prudhomme NaN
3 1901 Peace Individual 1822-05-20 Frédéric Passy NaN
4 1901 Peace Individual 1828-05-08 Jean Henry Dunant NaN
7 1902 Literature Individual 1817-11-30 Christian Matthias Theodor Mommsen NaN
9 1902 Peace Individual 1843-05-21 Charles Albert Gobat NaN
... ... ... ... ... ... ...
932 2018 Peace Individual 1993-07-02 Nadia Murad NaN
942 2019 Literature Individual 1942-12-06 Peter Handke NaN
946 2019 Peace Individual 1976-08-15 Abiy Ahmed Ali NaN
954 2020 Literature Individual 1943-04-22 Louise Glück NaN
958 2020 Peace Organization NaN World Food Programme (WFP) NaN

255 rows × 6 columns

We also see that many prizes went to people who were not affiliated with a university or research institute. This includes many of the Literature and Peace prize winners.

Type Conversions¶

Challenge:

  • Convert the birth_date column to Pandas Datetime objects
  • Add a Column called share_pct which has the laureates' share as a percentage in the form of a floating-point number.

Convert Year and Birth Date to Datetime¶

In [15]:
df_data.birth_date = pd.to_datetime(df_data.birth_date)

Add a Column with the Prize Share as a Percentage¶

The .str.split() method returns a Series where values are lists of two items such as [1, 1]. Then we use .map() or .apply() to get hold of the first or second element. See StackOverflow post

In [16]:
numerator = df_data.prize_share.str.split('/').map(lambda x: x[0]).astype('int64')
denominator = df_data.prize_share.str.split('/').map(lambda x: x[1]).astype('int64')
df_data.insert(loc=5, column='share_pct', 
               value=numerator/denominator) # loc to specify where to insert the column

Alternatively, I could have used the expand parameter of the .split() method, e.g. separated_values = df_data.prize_share.str.split('/', expand=True)

To select specic part of a dataframe: df.iloc[row_start:row_end , col_start, col_end]

In [17]:
df_data.iloc[2:4, :6]
Out[17]:
year category prize motivation prize_share share_pct
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its... 1/1 1.00
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 0.50

Sidenote: The .T attribute of a df directly returns the transposed version, e.g. df.T

Plotly Donut Chart: Percentage of Male vs. Female Laureates¶

Challenge: Create a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?

Note: In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types), e.g.

df_people = df_data.loc[~ df_data.sex.isna()]  # to exclude the organizations
In [18]:
prize_per_gender = df_data.groupby('sex', dropna=False).agg({'prize': 'count'})
prize_per_gender
Out[18]:
prize
sex
Female 58
Male 876
NaN 28

More elegant solution, from the course: df_data.sex.value_counts(dropna=False).to_frame()

In [19]:
fig1 = px.pie(prize_per_gender, values='prize', names=prize_per_gender.index, 
              title='Nobel Prize by gender', hole=0.5, height=500)
fig1.update_traces(textfont_size=15)
fig1.show()

Who were the first 3 Women to Win the Nobel Prize?¶

Challenge:

  • What are the names of the first 3 female Nobel laureates?
  • What did they win the prize for?
  • What do you see in their birth_country? Were they part of an organisation?
In [20]:
three_first_females = df_data.loc[df_data.sex=='Female']\
                      .sort_values('year').head(3)
three_first_females.full_name.values
Out[20]:
array(['Marie Curie, née Sklodowska',
       'Baroness Bertha Sophie Felicita von Suttner, née Countess Kinsky von Chinic und Tettau',
       'Selma Ottilia Lovisa Lagerlöf'], dtype=object)
In [21]:
three_first_females.drop(['organization_city', 'organization_country',\
                          'motivation', 'laureate_type', 'prize'], axis='columns')
Out[21]:
year category prize_share share_pct full_name birth_date birth_city birth_country birth_country_current sex organization_name ISO
18 1903 Physics 1/4 0.25 Marie Curie, née Sklodowska 1867-11-07 Warsaw Russian Empire (Poland) Poland Female NaN POL
29 1905 Peace 1/1 1.00 Baroness Bertha Sophie Felicita von Suttner, n... 1843-06-09 Prague Austrian Empire (Czech Republic) Czech Republic Female NaN CZE
51 1909 Literature 1/1 1.00 Selma Ottilia Lovisa Lagerlöf 1858-11-20 Mårbacka Sweden Sweden Female NaN SWE

Find the Repeat Winners¶

Challenge: Did some people get a Nobel Prize more than once? If so, who were they?

In [22]:
df_data.duplicated(subset=['full_name']).any()
Out[22]:
True
In [23]:
df_data.full_name.loc[df_data.duplicated(subset=['full_name'], 
                                         keep=False)].to_frame()
Out[23]:
full_name
18 Marie Curie, née Sklodowska
62 Marie Curie, née Sklodowska
89 Comité international de la Croix Rouge (Intern...
215 Comité international de la Croix Rouge (Intern...
278 Linus Carl Pauling
283 Office of the United Nations High Commissioner...
297 John Bardeen
306 Frederick Sanger
340 Linus Carl Pauling
348 Comité international de la Croix Rouge (Intern...
424 John Bardeen
505 Frederick Sanger
523 Office of the United Nations High Commissioner...

Number of Prizes per Category¶

Challenge:

  • In how many categories are prizes awarded?
  • Create a plotly bar chart with the number of prizes awarded by category.
  • Use the color scale called Aggrnyl to colour the chart, but don't show a color axis.
  • Which category has the most number of prizes awarded?
  • Which category has the fewest number of prizes awarded?
In [24]:
df_data.category.nunique()
Out[24]:
6
In [25]:
category_count = df_data.category.value_counts()
category_count.index
Out[25]:
Index(['Medicine', 'Physics', 'Chemistry', 'Peace', 'Literature', 'Economics'], dtype='object')
In [26]:
fig2 = px.bar(x=category_count.index, y=category_count.values, 
              height=500, title='Nobel Prizes per Category', 
              labels={'x': 'Category', 'y': 'Count'}, color_continuous_scale='Aggrnyl', 
              color=category_count.values)
fig2.update_layout(coloraxis_showscale=False)
fig2.show()

Challenge:

  • When was the first prize in the field of Economics awarded?
  • Who did the prize go to?
In [27]:
df_data.loc[df_data.category == 'Economics']\
       .sort_values('year').iloc[0].to_frame().T
Out[27]:
year category prize motivation prize_share share_pct laureate_type full_name birth_date birth_city birth_country birth_country_current sex organization_name organization_city organization_country ISO
393 1969 Economics The Sveriges Riksbank Prize in Economic Scienc... "for having developed and applied dynamic mode... 1/2 0.50 Individual Jan Tinbergen 1903-04-12 the Hague Netherlands Netherlands Male The Netherlands School of Economics Rotterdam Netherlands NLD

Male and Female Winners by Category¶

Challenge: Create a plotly bar chart that shows the split between men and women by category.

  • Hover over the bar chart. How many prizes went to women in Literature compared to Physics?
In [28]:
category_and_gender = df_data.groupby(['category', 'sex'], 
                                      as_index=False, dropna=False).agg({'prize': 'count'})
category_and_gender.rename(columns={'prize': 'Count', 'sex': 'Gender',\
                                    'category': 'Category'}, inplace=True)
category_and_gender.sort_values('Gender', ascending=False, 
                                inplace=True) # just to have the stacked columns with male first
category_and_gender.fillna(value='Organization', inplace=True)
category_and_gender
Out[28]:
Category Gender Count
1 Chemistry Male 179
3 Economics Male 84
5 Literature Male 101
7 Medicine Male 210
9 Peace Male 90
12 Physics Male 212
0 Chemistry Female 7
2 Economics Female 2
4 Literature Female 16
6 Medicine Female 12
8 Peace Female 17
11 Physics Female 4
10 Peace Organization 28
In [29]:
# fig3 = px.bar(category_and_gender, x='category', y='count', color='sex')
fig3 = px.bar(category_and_gender, x='Category', y='Count', color='Gender', 
              height=500, title='Nobel Prizes by Category and Gender')
fig3.update_xaxes(categoryorder='total descending')
fig3.show()

16 Nobel Prizes in Literature for females and 4 in Physics.

Number of Prizes Awarded Over Time¶

Now let's look at how things have changed over time. This will give us a chance to review what we learnt about creating charts with two y-axes in Matplotlib and generating arrays with NumPy.

Challenge: Are more prizes awarded recently than when the prize was first created? Show the trend in awards visually.

  • Count the number of prizes awarded every year.
  • Create a 5 year rolling average of the number of prizes (Hint: see previous lessons analysing Google Trends).
  • Using Matplotlib superimpose the rolling average on a scatter plot.
  • Show a tick mark on the x-axis for every 5 years from 1900 to 2020. (Hint: you'll need to use NumPy).

  • Use the named colours to draw the data points in dogerblue while the rolling average is coloured in crimson.

In [30]:
prizes_per_year = df_data.groupby('year').agg({'prize': 'count'})
prizes_per_year
Out[30]:
prize
year
1901 6
1902 7
1903 7
1904 6
1905 5
... ...
2016 11
2017 12
2018 13
2019 14
2020 12

117 rows × 1 columns

In [31]:
prizes_per_year_rolling = prizes_per_year.rolling(window=5).mean()
prizes_per_year_rolling
Out[31]:
prize
year
1901 NaN
1902 NaN
1903 NaN
1904 NaN
1905 6.20
... ...
2016 11.60
2017 12.00
2018 12.00
2019 12.20
2020 12.40

117 rows × 1 columns

Note: We can explicitly set where you want to tick marks with plt.xticks and np.arange() -> see StackOverflow post

In [32]:
x_ticks = np.arange(1900, 2021, 5)
x_ticks
Out[32]:
array([1900, 1905, 1910, 1915, 1920, 1925, 1930, 1935, 1940, 1945, 1950,
       1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005,
       2010, 2015, 2020])
In [33]:
plt.figure(figsize=(16,8), dpi=200)
plt.xticks(ticks=x_ticks, fontsize=14, rotation=45)
plt.yticks(fontsize=14)
plt.title("Number of Nobel Prizes per Year", fontsize=18)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Number of Prizes', fontsize=16)
plt.xlim(1900, 2020)

ax1 = plt.gca() # Get Current Axes
ax2 = plt.gca()

ax1.scatter(prizes_per_year.index, prizes_per_year.prize, c='dodgerblue', 
            alpha=0.6, s=100)
ax2.plot(prizes_per_year.index, prizes_per_year_rolling.prize, c='crimson', 
        linewidth=3)
ax2.grid(color='grey', linestyle='--')
plt.show()
  • Looking at the chart, did the first and second world wars have an impact on the number of prizes being given out?
  • What could be the reason for the trend in the chart?

Are More Prizes Shared Than Before?¶

Challenge: Investigate if more prizes are shared than before.

  • Calculate the average prize share of the winners on a year by year basis.
  • Calculate the 5 year rolling average of the percentage share.
  • Copy-paste the cell from the chart you created above.
  • Modify the code to add a secondary axis to your Matplotlib chart.
  • Plot the rolling average of the prize share on this chart.
  • See if you can invert the secondary y-axis to make the relationship even more clear.
In [34]:
prize_share = df_data.groupby('year').agg({'share_pct': 'mean'}).sort_values('year')
prize_share
Out[34]:
share_pct
year
1901 0.83
1902 0.71
1903 0.71
1904 0.83
1905 1.00
... ...
2016 0.55
2017 0.50
2018 0.46
2019 0.43
2020 0.50

117 rows × 1 columns

In [35]:
prize_share_rolling = prize_share.rolling(window=5).mean()
In [36]:
plt.figure(figsize=(16,8), dpi=200)
plt.xticks(ticks=x_ticks, fontsize=14, rotation=45)
plt.yticks(fontsize=14)
plt.title("Number of Nobel Prizes per Year", fontsize=18)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Number of Prizes', fontsize=16, color='crimson')
plt.xlim(1900, 2020)

ax1 = plt.gca() # Get Current Axes
ax2 = plt.gca()
ax3 = plt.twinx()

ax1.scatter(prizes_per_year.index, prizes_per_year.prize, c='dodgerblue', 
            alpha=0.6, s=100)
ax2.plot(prizes_per_year.index, prizes_per_year_rolling, c='crimson', 
        linewidth=3)
ax2.grid(color='grey', linestyle='--')

# Third positional argument: fmt = '[marker][line][color]'
ax3.plot(prizes_per_year.index, prize_share_rolling.share_pct, '--g', # dashed line green
        linewidth=3)
ax3.set_ylim(0, 1)
ax3.set_ylabel('Prize Share', fontsize=16, color='green')
plt.setp(ax3.get_yticklabels(), fontsize=14)
ax3.invert_yaxis()

plt.show()

What do we see on the chart? Well, there is clearly an upward trend in the number of prizes being given out as more and more prizes are shared. Also, more prizes are being awarded from 1969 onwards because of the addition of the economics category. We also see that very few prizes were awarded during the first and second world wars. Note that instead of there being a zero entry for those years, we instead see the effect of the wards as missing blue dots.

The Countries with the Most Nobel Prizes¶

Challenge:

  • Create a Pandas DataFrame called top20_countries that has the two columns. The prize column should contain the total number of prizes won.

  • Is it best to use birth_country, birth_country_current or organization_country?
  • What are some potential problems when using birth_country or any of the others? Which column is the least problematic?
  • Then use plotly to create a horizontal bar chart showing the number of prizes won by each country. Here's what you're after: Link

  • What is the ranking for the top 20 countries in terms of the number of prizes?

Looking at our DataFrame there are actually 3 different columns to choose from for creating this ranking: birth_country, birth_country_current or organization_country. However, they each have certain problems and limitations.

If you look at the entries in the birth country, you'll see that some countries no longer exist! These include the Soviet Union or Czechoslovakia for example. Hence, using birth_country_current is better, since it has the country name which controls the city where the laureate was born. Now, notice that this does not determine the laureates' nationality since some globetrotting folks gave birth to their future Nobel laureate children while abroad. Also, people's nationalities can change as they emigrate and acquire different citizenship or get married and change citizenship. What this boils down to is that we will have to be clear about the assumptions that we will make in the upcoming analysis.

In [37]:
top20_countries = df_data.groupby('birth_country_current', as_index=False)\
                  .agg({'prize': 'count'}).sort_values('prize', ascending=False).head(20)
top20_countries
Out[37]:
birth_country_current prize
74 United States of America 281
73 United Kingdom 105
26 Germany 84
25 France 57
67 Sweden 29
40 Japan 27
57 Poland 27
61 Russia 26
11 Canada 20
68 Switzerland 19
39 Italy 19
3 Austria 18
51 Netherlands 18
20 Denmark 12
13 China 12
54 Norway 12
2 Australia 10
7 Belgium 9
33 India 9
65 South Africa 9
In [38]:
fig4 = px.bar(top20_countries, x='prize', y='birth_country_current', color='prize', 
              orientation='h', height=500, title='Top 20 Countries by Number of Nobel Prizes', 
              color_continuous_scale='Viridis')
fig4.update_layout(xaxis_title='Number of Prizes', yaxis_title='Country', 
                   coloraxis_showscale=False)
fig4.update_yaxes(categoryorder='total ascending')
fig4.show()

The United States has a massive number of prizes by this measure. The UK and Germany are in second and third place respectively.

Use a Choropleth Map to Show the Number of Prizes Won by Country¶

  • Create this choropleth map using the plotly documentation:
  • Experiment with plotly's available colours. I quite like the sequential colour matter on this map.

Hint: You'll need to use a 3 letter country code for each country.

In [39]:
prizes_per_ISO = df_data.groupby('ISO', as_index=False)\
                 .agg({'prize': 'count'}).sort_values('prize', ascending=False)
prizes_per_ISO
Out[39]:
ISO prize
73 USA 281
28 GBR 105
19 DEU 84
27 FRA 57
68 SWE 29
... ... ...
50 LVA 1
51 MAR 1
52 MDG 1
54 MKD 1
78 ZWE 1

79 rows × 2 columns

In [40]:
fig5 = px.choropleth(prizes_per_ISO, locations='ISO', color='prize', 
                     height=700, color_continuous_scale='matter', 
                     labels={'ISO': 'Country', 'prize': "Nb of Nobel Prizes"})
# fig4.update_layout(coloraxis_showscale=False)
fig5.show()

In Which Categories are the Different Countries Winning Prizes?¶

Challenge: See if you can divide up the plotly bar chart you created above to show the which categories made up the total number of prizes. Here's what you're aiming for.

  • In which category are Germany and Japan the weakest compared to the United States?
  • In which category does Germany have more prizes than the UK?
  • In which categories does France have more prizes than Germany?
  • Which category makes up most of Australia's nobel prizes?
  • Which category makes up half of the prizes in the Netherlands?
  • Does the United States have more prizes in Economics than all of France? What about in Physics or Medicine?

The hard part is preparing the data for this chart!

Hint: Take a two-step approach. The first step is grouping the data by country and category. Then you can create a DataFrame that looks something like this:

Problem: If we groupby country and category first, we can't really select the top20 anymore. So we first need to fiter, from the original df, the top 20 countries. Then we can groupby.

The method .isin() allows to check in values are in another Series.

In [41]:
top20_countries_categ = df_data.loc[df_data.birth_country_current.isin(top20_countries.birth_country_current)]\
                        .groupby(['birth_country_current', 'category'], as_index=False).agg({'prize': 'count'})\
                        .sort_values('prize', ascending=False)
top20_countries_categ
Out[41]:
birth_country_current category prize
106 United States of America Medicine 78
108 United States of America Physics 70
103 United States of America Chemistry 55
104 United States of America Economics 49
100 United Kingdom Medicine 28
... ... ... ...
18 Canada Peace 1
62 Netherlands Peace 1
28 Denmark Peace 1
71 Poland Economics 1
0 Australia Chemistry 1

109 rows × 3 columns

In [42]:
fig6 = px.bar(top20_countries_categ, x='prize', y='birth_country_current', color='category', 
              orientation='h', height=500, title='Top 20 Countries by Number of Nobel Prizes and Category', 
              labels={'prize': 'Number of Prizes', 'birth_country_current': 'Country', 'category': 'Category'})
fig6.update_yaxes(categoryorder='total ascending')
fig6.show()

Splitting the country bar chart by category allows us to get a very granular look at the data and answer a whole bunch of questions. For example, we see is that the US has won an incredible proportion of the prizes in the field of Economics. In comparison, Japan and Germany have won very few or no economics prize at all. Also, the US has more prizes in physics or medicine alone than all of France's prizes combined. On the chart, we also see that Germany won more prizes in physics than the UK and that France has won more prizes in peace and literature than Germany, even though Germany has been awarded a higher total number of prizes than France.

Number of Prizes Won by Each Country Over Time¶

  • When did the United States eclipse every other country in terms of the number of prizes won?
  • Which country or countries were leading previously?
  • Calculate the cumulative number of prizes won by each country in every year. Again, use the birth_country_current of the winner to calculate this.
  • Create a plotly line chart where each country is a coloured line.
In [43]:
# making sure the dataset is sorted by year and resetting index
df_data_sorted = df_data.sort_values('year').reset_index(drop=True)
cumul_prize_country_series = df_data_sorted.groupby('birth_country_current').cumcount()
# N.B. using .copy() below to make it a separate copy, and avoid warning when creating new column
df_cumul_country = df_data_sorted[['year', 'birth_country_current']].copy()
df_cumul_country['Cumulated Number of Nobel Prizes'] = cumul_prize_country_series
# The cumul. count starts with 0, whereas first occurrence should be 1 --> adding 1 to all
df_cumul_country['Cumulated Number of Nobel Prizes'] = df_cumul_country['Cumulated Number of Nobel Prizes'] + 1
df_cumul_country.rename(columns={'year': 'Year', 'birth_country_current': 'Country'}, inplace=True)
df_cumul_country.sort_values(['Country', 'Year'], inplace=True) # sorting by country then by year
df_cumul_country = df_cumul_country.dropna().reset_index(drop=True)
df_cumul_country
Out[43]:
Year Country Cumulated Number of Nobel Prizes
0 1957 Algeria 1.00
1 1997 Algeria 2.00
2 1936 Argentina 1.00
3 1947 Argentina 2.00
4 1980 Argentina 3.00
... ... ... ...
929 2020 United States of America 281.00
930 1980 Venezuela 1.00
931 1973 Vietnam 1.00
932 2011 Yemen 1.00
933 1960 Zimbabwe 1.00

934 rows × 3 columns

Note on copying DataFrames: It's interesting to note that when creating new variables containing Series or DataFrame, they actually still point to the original data, so that if the original data is changed, it will be reflected on all DataFrames and Series depending from it. At least, that's how I understood it.

If we use .copy() though, it will create an entirely separate entity, detached from the original.

This is one way to avoid the "SettingWithCopyWarning" when creating a new column. Other ways would probably include adding the column in the original df instead of a copy, or using the .insert() method.

Remarks on the created DataFrame: Contrary to the course solution, with this method and because of the way the .cumcount() Series is created, we keep the exact same number of rows as the original df, which means also the rows where the Country is NaN (if organization instead of people).

Also, if a country wins more than one prize in a year, we have several entries in our df, for each prize, which can make it tricky for the graph when sorting by country, because it has two or more points for the same year. In the course solution. there is only one entry per country per year, i.e. the cumulative count/sum jumps to 2 more directly.

This explains why, even if we drop the NaN values in our df, we still have way more rows than in the solution.

Note on sort_values(): We can actually sort by more than one column, e.g sort by country, then by year.

Course solution: First we count the number of prizes by country by year. It's important we keep the new indices for next step (so not using as_index=False this time).

In [44]:
prize_by_year = df_data.groupby(by=['birth_country_current', 'year']).agg({'prize': 'count'})
prize_by_year
Out[44]:
prize
birth_country_current year
Algeria 1957 1
1997 1
Argentina 1936 1
1947 1
1980 1
... ... ...
United States of America 2020 7
Venezuela 1980 1
Vietnam 1973 1
Yemen 2011 1
Zimbabwe 1960 1

627 rows × 1 columns

Then we can use the cumulative sum, by country, instead of the cumulative count. And in order not to lose the country column, we keep it as an index and we must specify to group by an index level.

In [45]:
cumulative_prizes = prize_by_year.groupby(level=[0]).cumsum()
cumulative_prizes
Out[45]:
prize
birth_country_current year
Algeria 1957 1
1997 2
Argentina 1936 1
1947 2
1980 3
... ... ...
United States of America 2020 281
Venezuela 1980 1
Vietnam 1973 1
Yemen 2011 1
Zimbabwe 1960 1

627 rows × 1 columns

Only then, we can reset the index, for better use in the graph, which doesn't know how to deal with Multilevel Index.

In [46]:
cumulative_prizes.reset_index(inplace=True)
cumulative_prizes
Out[46]:
birth_country_current year prize
0 Algeria 1957 1
1 Algeria 1997 2
2 Argentina 1936 1
3 Argentina 1947 2
4 Argentina 1980 3
... ... ... ...
622 United States of America 2020 281
623 Venezuela 1980 1
624 Vietnam 1973 1
625 Yemen 2011 1
626 Zimbabwe 1960 1

627 rows × 3 columns

In [47]:
fig7 = px.line(df_cumul_country, x='Year', y='Cumulated Number of Nobel Prizes', 
               color='Country', height=500)
fig7.show()
In [48]:
fig8 = px.line(cumulative_prizes, x='year', y='prize', 
               color='birth_country_current', height=500,
               labels={'year': 'Year', 'prize': 'Cumulated Number of Prizes',
                       'birth_country_current': 'Country'})
fig8.show()

What we see is that the United States really started to take off after the Second World War which decimated Europe. Prior to that, the Nobel prize was pretty much a European affair. Very few laureates were chosen from other parts of the world. This has changed dramatically in the last 40 years or so. There are many more countries represented today than in the early days. Interestingly we also see that the UK and Germany traded places in the 70s and 90s on the total number of prizes won. Sweden being 5th place pretty consistently over many decades is quite interesting too. Perhaps this reflects a little bit of home bias?

What are the Top Research Organisations?¶

Many Nobel laureates are affiliated with a university, a laboratory, or a research organisation (apart from Literature and Peace prize winners as we've seen). But the world is a big place. Which research institutions had the most Nobel laureates working there at the time of making the discovery?

Challenge: Create a bar chart showing the organisations affiliated with the Nobel laureates. It should looks something like this.

  • Which organisations make up the top 20?
  • How many Nobel prize winners are affiliated with the University of Chicago and Harvard University?
In [49]:
# Two homemade methods to just return the unique grouping names that one can expect when using groupby
# Both methods return the exact same result
orgs = df_data[['organization_country', 'organization_city', 'organization_name']].dropna().drop_duplicates()\
       .reset_index(drop=True).sort_values(['organization_country', 'organization_city', 'organization_name'])
df_data.groupby(['organization_country', 'organization_city', 'organization_name'], as_index=False)\
       .agg({'prize': 'count'}).drop('prize', axis='columns')
Out[49]:
organization_country organization_city organization_name
0 Argentina Buenos Aires Institute for Biochemical Research
1 Argentina Buenos Aires Instituto de Biologia y Medicina Experimental ...
2 Australia Canberra Australian National University
3 Australia Melbourne Walter and Eliza Hall Institute for Medical Re...
4 Australia Nedlands NHMRC Helicobacter pylori Research Laboratory,...
... ... ... ...
286 United States of America West Lafayette, IN Purdue University
287 United States of America Wilmington, DE Du Pont
288 United States of America Woods Hole, MA Marine Biological Laboratory (MBL)
289 United States of America Worcester, MA University of Massachusetts Medical School
290 United States of America Yorktown Heights, NY IBM Thomas J. Watson Research Center

291 rows × 3 columns

In [50]:
orgs.loc[orgs.organization_name == 'University of California']
Out[50]:
organization_country organization_city organization_name
73 United States of America Berkeley, CA University of California
205 United States of America Irvine, CA University of California
116 United States of America Los Angeles, CA University of California
124 United States of America San Diego, CA University of California
190 United States of America San Francisco, CA University of California
214 United States of America Santa Barbara, CA University of California

This is a bias not explained in the next analysis, when sorting by organization_name, the University of California is actually divided between 6 different sites...

In [51]:
prize_per_organization = df_data.groupby('organization_name').agg({'prize': 'count'})
prize_per_organization.reset_index(inplace=True)
# Using lists to sort by nb of prizes then by alphabet
prize_per_organization.sort_values(['prize', 'organization_name'], ascending=[False, True], inplace=True)
top20_organizations = prize_per_organization.head(20)
top20_organizations
Out[51]:
organization_name prize
196 University of California 40
68 Harvard University 29
167 Stanford University 23
117 Massachusetts Institute of Technology (MIT) 21
198 University of Chicago 20
197 University of Cambridge 18
26 California Institute of Technology (Caltech) 17
38 Columbia University 17
146 Princeton University 15
119 Max-Planck-Institut 13
152 Rockefeller University 13
222 University of Oxford 12
111 MRC Laboratory of Molecular Biology 10
258 Yale University 9
12 Bell Laboratories 8
40 Cornell University 8
67 Harvard Medical School 7
79 Institut Pasteur 7
109 London University 7
163 Sorbonne University 7
In [52]:
fig9 = px.bar(top20_organizations, x='prize', y='organization_name', color='prize', 
              orientation='h', height=500, title='Top 20 Research Institutions by Number of Nobel Prizes', 
              color_continuous_scale='Viridis')
fig9.update_layout(xaxis_title='Number of Prizes', yaxis_title='Institution', 
                   coloraxis_showscale=False)
fig9.update_yaxes(categoryorder='total ascending')
fig9.show()

Which Cities Make the Most Discoveries?¶

Each research organisation is located in a particular city. Are some cities hot spots for scientific discoveries?

Where do major discoveries take place?

Challenge:

  • Create another plotly bar chart graphing the top 20 organisation cities of the research institutions associated with a Nobel laureate.
  • Where is the number one hotspot for discoveries in the world?
  • Which city in Europe has had the most discoveries?
In [53]:
prize_per_city = df_data.groupby('organization_city', as_index=False).agg({'prize': 'count'}).sort_values('prize', ascending=False)
top20_cities = prize_per_city.head(20)
top20_cities
Out[53]:
organization_city prize
34 Cambridge, MA 50
121 New York, NY 45
33 Cambridge 31
92 London 27
128 Paris 25
156 Stanford, CA 24
12 Berkeley, CA 21
39 Chicago, IL 20
135 Princeton, NJ 19
129 Pasadena, CA 17
126 Oxford 12
13 Berlin 12
157 Stockholm 10
108 Moscow 10
71 Heidelberg 9
110 Munich 9
23 Boston, MA 9
119 New Haven, CT 9
78 Ithaca, NY 8
65 Göttingen 8
In [54]:
fig10 = px.bar(top20_cities, x='prize', y='organization_city', color='prize', 
              orientation='h', height=500, title='Top 20 Cities by Number of Nobel Prizes', 
              color_continuous_scale='Plasma', 
              labels={'organization_city': 'City', 'prize': 'Number of Prizes'})
fig10.update_layout(coloraxis_showscale=False)
fig10.update_yaxes(categoryorder='total ascending')
fig10.show()

Where are Nobel Laureates Born? Chart the Laureate Birth Cities¶

Challenge:

  • Create a plotly bar chart graphing the top 20 birth cities of Nobel laureates.
  • Use a named colour scale called Plasma for the chart.
  • What percentage of the United States prizes came from Nobel laureates born in New York?
  • How many Nobel laureates were born in London, Paris and Vienna?
  • Out of the top 5 cities, how many are in the United States?
In [55]:
prize_per_birth_city = df_data.groupby('birth_city', as_index=False).agg({'prize': 'count'}).sort_values('prize', ascending=False)
top20_birth_cities = prize_per_birth_city.head(20)
In [56]:
fig11 = px.bar(top20_birth_cities, x='prize', y='birth_city', color='prize', 
              orientation='h', height=500, title='Top 20 Birth Cities by Number of Nobel Prizes', 
              color_continuous_scale='Plasma', 
              labels={'birth_city': 'City', 'prize': 'Number of Prizes'})
fig11.update_layout(coloraxis_showscale=False)
fig11.update_yaxes(categoryorder='total ascending')
fig11.show()

A higher population definitely means that there's a higher chance of a Nobel laureate to be born there. New York, Paris, and London are all very populous. However, Vienna and Budapest are not and still produced many prize winners. That said, much of the ground-breaking research does not take place in big population centres, so the list of birth cities is quite different from the list above. Cambridge Massachusets, Stanford, Berkely and Cambridge (UK) are all the places where many discoveries are made, but they are not the birthplaces of laureates.

In [57]:
total_prizes_birth_USA = len(df_data.loc[df_data.birth_country_current == 'United States of America'])
total_prizes_birth_NYC = len(df_data.loc[df_data.birth_city == 'New York, NY'])
pct_NYC = total_prizes_birth_NYC/total_prizes_birth_USA*100
print(f"{pct_NYC:.0f}% of the United States prizes came from Nobel laureates born in New York.")
19% of the United States prizes came from Nobel laureates born in New York.

Plotly Sunburst Chart: Combine Country, City, and Organisation¶

Challenge:

  • Create a DataFrame that groups the number of prizes by organisation.
  • Then use the plotly documentation to create a sunburst chart
  • Click around in your chart, what do you notice about Germany and France?

Here's what you're aiming for: link

Note: The sunburst graph is interactive! We can click on the elements.

In [58]:
prize_orgs = df_data.groupby(['organization_country', 'organization_city', 'organization_name'], 
                as_index=False).agg({'prize': 'count'}).rename(columns={'prize': 'nb_of_nobel_prizes'})
fig12 = px.sunburst(prize_orgs, path=['organization_country', 'organization_city', 'organization_name'],
                    values='nb_of_nobel_prizes', height=1200,
                    title='Nobel Prizes distributed by Research Institutions')
fig12.show()

Patterns in the Laureate Age at the Time of the Award¶

How Old Are the Laureates When the Win the Prize?

Challenge: Calculate the age of the laureate in the year of the ceremony and add this as a column called winning_age to the df_data DataFrame. Hint: you can use this to help you.

In [59]:
df_data.insert(loc=8, column='winning_age', value=df_data.year - df_data.birth_date.dt.year)
df_data.iloc[:5,:10]  # df.iloc[row_start:row_end, column_start:column_end]
Out[59]:
year category prize motivation prize_share share_pct laureate_type full_name winning_age birth_date
0 1901 Chemistry The Nobel Prize in Chemistry 1901 "in recognition of the extraordinary services ... 1/1 1.00 Individual Jacobus Henricus van 't Hoff 49.00 1852-08-30
1 1901 Literature The Nobel Prize in Literature 1901 "in special recognition of his poetic composit... 1/1 1.00 Individual Sully Prudhomme 62.00 1839-03-16
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its... 1/1 1.00 Individual Emil Adolf von Behring 47.00 1854-03-15
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 0.50 Individual Frédéric Passy 79.00 1822-05-20
4 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 0.50 Individual Jean Henry Dunant 73.00 1828-05-08

Who were the oldest and youngest winners?¶

Challenge:

  • What are the names of the youngest and oldest Nobel laureate?
  • What did they win the prize for?
  • What is the average age of a winner?
  • 75% of laureates are younger than what age when they receive the prize?
In [60]:
df_data.loc[df_data.winning_age == df_data.winning_age.min()].iloc[:, :14]
Out[60]:
year category prize motivation prize_share share_pct laureate_type full_name winning_age birth_date birth_city birth_country birth_country_current sex
885 2014 Peace The Nobel Peace Prize 2014 "for their struggle against the suppression of... 1/2 0.50 Individual Malala Yousafzai 17.00 1997-07-12 Mingora Pakistan Pakistan Female

The youngest Nobel laureate is Malala Yousafzai at 17 years old.

In [61]:
df_data.loc[df_data.winning_age == df_data.winning_age.max()].iloc[:, :14]
Out[61]:
year category prize motivation prize_share share_pct laureate_type full_name winning_age birth_date birth_city birth_country birth_country_current sex
937 2019 Chemistry The Nobel Prize in Chemistry 2019 “for the development of lithium-ion batteries” 1/3 0.33 Individual John Goodenough 97.00 1922-07-25 Jena Germany Germany Male

The youngest Nobel laureate is John Goodenough at 97 years old.

In [62]:
print(f"The average age of a Nobel prize winner is {df_data.winning_age.mean():.0f} years old.")
The average age of a Nobel prize winner is 60 years old.

Descriptive Statistics for the Laureate Age at Time of Award¶

  • Calculate the descriptive statistics for the age at the time of the award.
  • Then visualise the distribution in the form of a histogram using Seaborn's .histplot() function.
  • Experiment with the bin size. Try 10, 20, 30, and 50.
In [63]:
df_data.winning_age.to_frame().describe()
Out[63]:
winning_age
count 934.00
mean 59.95
std 12.62
min 17.00
25% 51.00
50% 60.00
75% 69.00
max 97.00

75% of the laureates are less than 69 years old.

In [64]:
prize_per_age = df_data.winning_age.to_frame().dropna().sort_values('winning_age').reset_index(drop=True)
prize_per_age.winning_age = prize_per_age.winning_age.astype('int64')
prize_per_age
Out[64]:
winning_age
0 17
1 25
2 25
3 31
4 31
... ...
929 89
930 89
931 90
932 96
933 97

934 rows × 1 columns

In [65]:
np.random.randint(0, 5, 10)
Out[65]:
array([3, 3, 1, 1, 2, 2, 1, 0, 0, 0])
In [66]:
hist_x_ticks = np.arange(10, 101, 10)
hist_x_ticks
Out[66]:
array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])
In [67]:
plt.figure(figsize=(8,4), dpi=200)
# using ticks style (so xticks appear) and overriding background color
with sns.axes_style('ticks', rc={'axes.facecolor': '#EAEAF2'}):
    hist = sns.histplot(x=prize_per_age.winning_age, bins=30)
    hist.set(xlabel='Age of Nobel Prize Laureates')
    hist.grid(color='white', linestyle='--', axis='y')
# plt.xticks(ticks=hist_x_ticks)

plt.show()

Options for the histogram include:

  • forcing a number of bins, e.g. bins=20
  • showing all discrete values: discrete=True
  • using a self-made tick range: plt.xticks(ticks=np.arange(10, 101, 10))

Tips:

  • I can force to display several elements in the same block of code, without using print: display()
  • I can select the N largest or N smallest of a DataFrame: .nlargest(), .nsmallest()
    display(df_data.nlargest(n=1, columns='winning_age'))
    display(df_data.nsmallest(n=1, columns='winning_age'))
    

Age at Time of Award throughout History¶

Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?

Challenge

  • Use Seaborn to create a .regplot with a trendline.
  • Set the lowess parameter to True to show a moving average of the linear fit.
  • According to the best fit line, how old were Nobel laureates in the years 1900-1940 when they were awarded the prize?
  • According to the best fit line, what age would it predict for a Nobel laureate in 2020?
In [68]:
age_per_year = df_data[['year', 'winning_age']].dropna().reset_index(drop=True)
age_per_year
Out[68]:
year winning_age
0 1901 49.00
1 1901 62.00
2 1901 47.00
3 1901 79.00
4 1901 73.00
... ... ...
929 2020 85.00
930 2020 71.00
931 2020 55.00
932 2020 68.00
933 2020 89.00

934 rows × 2 columns

Instead of a straight regression line, using the lowess parameter gives us a moving linear fit ("moving average of the linear fit").

Alternatively, we can also set the order of the polynomial fit to a higher degree, e.g. order=2

In [69]:
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
    # N.B. with regplot, the styles of the line and of the scatter must be set separately
    # in these two parameters: scatter_kws and line_kws
    reg = sns.regplot(x=age_per_year.year, y=age_per_year.winning_age,
          lowess=True,    # must install the 'statsmodels' package to use this parameter
          # order=2,  
          scatter_kws={'s':8, 'alpha': 0.4, 'color': '#2f4b7c'}, # s for the size of the dots
          line_kws={'color': 'brown', 'linewidth': 2})
    reg.set(xlabel='Year of Nobel Prize Award', ylabel='Age of Prize Laureate')
plt.show()

Using the lowess parameter allows us to plot a local linear regression. This means the best fit line is still linear, but it's more like a moving average which gives us a non-linear shape across the entire series. This is super neat because it clearly shows how the Nobel laureates are getting their award later and later in life. From 1900 to around 1950, the laureates were around 55 years old, but these days they are closer to 70 years old when they get their award! The other thing that we see in the chart is that in the last 10 years the spread has increased. We've had more very young and very old winners. In 1950s/60s winners were between 30 and 80 years old. Lately, that range has widened.

Winning Age Across the Nobel Prize Categories¶

How does the age of laureates vary by category?

  • Use Seaborn's .boxplot() to show how the mean, quartiles, max, and minimum values vary across categories. Which category has the longest "whiskers"?
  • In which prize category are the average winners the oldest?
  • In which prize category are the average winners the youngest?
In [70]:
age_by_category = df_data[['category', 'winning_age']].dropna().reset_index(drop=True)\
                  .rename(columns={'category': 'Nobel Prize Category', 'winning_age': 'Age of Prize Laureate'})
age_by_category
Out[70]:
Nobel Prize Category Age of Prize Laureate
0 Chemistry 49.00
1 Literature 62.00
2 Medicine 47.00
3 Peace 79.00
4 Peace 73.00
... ... ...
929 Medicine 85.00
930 Medicine 71.00
931 Physics 55.00
932 Physics 68.00
933 Physics 89.00

934 rows × 2 columns

In [71]:
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
    box = sns.boxplot(data=age_by_category,
                      x='Nobel Prize Category', y='Age of Prize Laureate',
                      # hue='Nobel Prize Category' # no need for it, colors already applied
                     )
# plt.legend([],[], frameon=False)  # to remove the legend, if there is one
plt.show()

The box plot shows us the mean, the quartiles, the maximum and the minimum values. It raises an interesting question: "Are peace prize winners really older than physics laureates?".

Note: can also do the box plot with Plotly Express px

Challenge

.lmplot() combines regplot() and FacetGrid. It is intended as a convenient interface to fit regression models across conditional subsets of a dataset.

  • Now use Seaborn's .lmplot() and the row parameter to create 6 separate charts for each prize category. Again set lowess to True.
  • What are the winning age trends in each category?
  • Which category has the age trending up and which category has the age trending down?
  • Is this .lmplot() telling a different story from the .boxplot()?
  • Create another chart with Seaborn. This time use .lmplot() to put all 6 categories on the same chart using the hue parameter.
In [72]:
age_per_year_and_category = df_data[['year', 'category', 'winning_age']].dropna().reset_index(drop=True)\
                            .rename(columns={'year': 'Year', 'category': 'Category', 'winning_age': 'Age of Prize Laureate'})
age_per_year_and_category
Out[72]:
Year Category Age of Prize Laureate
0 1901 Chemistry 49.00
1 1901 Literature 62.00
2 1901 Medicine 47.00
3 1901 Peace 79.00
4 1901 Peace 73.00
... ... ... ...
929 2020 Medicine 85.00
930 2020 Medicine 71.00
931 2020 Physics 55.00
932 2020 Physics 68.00
933 2020 Physics 89.00

934 rows × 3 columns

In [73]:
plt.figure(dpi=200)
sns.set_style('darkgrid')
sns.lmplot(data=age_per_year_and_category,
                x='Year', y='Age of Prize Laureate',
                hue='Category', lowess=True,
                scatter_kws = {'alpha': 0.5},
                line_kws={'linewidth': 3},
                height=8, aspect=1.5) # Height (in inches), and aspect ratio, so that aspect * height gives the width
plt.title='Trend of Nobel Prize Laureate Age per Category'
plt.show()
<Figure size 1280x960 with 0 Axes>

We can put of the graphs (facets) in individual rows with the row='Category' argument, but it's a bit of a waste of space. Instead, we put them in columns, but we say to wrap to a new row after 2 graphs: col='Category', col_wrap=2

In [74]:
plt.figure(dpi=200)
sns.set_style('darkgrid')
sns.lmplot(data=age_per_year_and_category,
           scatter_kws = {'alpha': 0.5},
           line_kws={'linewidth': 3},
           x='Year', y='Age of Prize Laureate',
           col='Category', hue='Category', lowess=True,
           col_wrap=2,
           height=4, aspect=2)  # of each facet
plt.show()
<Figure size 1280x960 with 0 Axes>

We see that winners in physics, chemistry, and medicine have gotten older over time. The ageing trend is strongest for physics. The average age used to be below 50, but now it's over 70. Economics, the newest category, is much more stable in comparison. The peace prize shows the opposite trend where winners are getting younger! As such, our scatter plots showing the best fit lines over time and our box plot of the entire dataset can tell very different stories!

Learning Points & Summary¶

In this lesson, we reviewed many concepts that we've covered previously, including:

  • Create a rolling average to smooth out time-series data and show a trend, e.g. prizes_per_year.rolling(window=5).mean()

  • How to use .value_counts(), .groupby(), .merge(), .sort_values() and .agg(). Note: sometimes .value_counts() more elegant than .groupby().agg({'column': 'count'})

  • Summary of all the import statements:

import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
  • Using the .transpose() or the .T to get the transposed view of a DataFrame

  • Using .insert(): df_data.insert(loc=5, column='share_pct', value=numerator/denominator)

  • Using .drop([labels_list], axis='columns') when dropping columns (and axis="rows" by default otherwise)

  • Using .duplicated() with parameters, e.g. df_data.duplicated(subset=['full_name'], keep=False)

  • Using .nunique() to find number of unique values in a Series

  • Using .reset_index, for example after dropping values: df_cumul_country.dropna().reset_index(drop=True)

  • Using .dt to access datetime attributes or methods, applied to a Series, e.g. df.birth_year.dt.year

  • Using .describe() to return "Descriptive Statistics" about a DataFrame

  • Using astype() to convert a Series to a datatype, e.g. .astype('int64')

  • Creating plotly donut charts with .pie() and some parameters like hole=0.5, height=500, title=

  • Creating plotly bar charts with .bar() and parameters like labels={'old': 'new'}, color_continuous_scale='Aggrnyl', color=category, as well as orientation='h' to set the bars horizontal (consider the new x and y params in this case)

  • Customizing plotly charts with .update_traces() and parameters like textfont_size=15, coloraxis_showscale=False

  • Changing the order of the bars in plotly with fig3.update_xaxes(categoryorder='total descending')

  • Creating regression plots with seaborn and .regplot() and changing the style of the scatter plot with scatter_kws={} and of the reg line with line_kws={}, e.g.

scatter_kws={'s':8, 'alpha': 0.4, 'color': '#2f4b7c'}, # s for the size of the dots
line_kws={'color': 'brown', 'linewidth': 2}
  • Using matplotlib mainly for graphs with several y-axes with .plot() and .scatter, and using all the parameters and methods such as:
plt.figure(figsize=(16,8), dpi=200)    # setting the size and resolution for display
plt.xticks(ticks=x_ticks, fontsize=14, rotation=45)  # styling the x-ticks
plt.yticks(fontsize=14)                              # styling the y-ticks
plt.title("Number of Nobel Prizes per Year", fontsize=18) # give a title
plt.xlabel('Year', fontsize=16)                              # styling the x-label
plt.ylabel('Number of Prizes', fontsize=16, color='crimson') # styling the x-label
plt.xlim(1900, 2020)                                 # setting the x-axis boundaries

ax1 = plt.gca() # Get Current Axes
ax2 = plt.twinx() # Creating a second y-axis (which shares the same x-axis)

ax1.scatter(prizes_per_year.index, prizes_per_year.prize, c='dodgerblue', 
            alpha=0.6, s=100) # c for color, alpha transparency, s marker size
ax1.plot(prizes_per_year.index, prizes_per_year_rolling, c='crimson', 
        linewidth=3)          # increasing linewidth
ax1.grid(color='grey', linestyle='--')  # adding a grid

# Third positional argument in plot is fmt = '[marker][line][color]'
ax2.plot(prizes_per_year.index, prize_share_rolling.share_pct, '--g', # dashed line green
        linewidth=3)
ax2.set_ylim(0, 1)
ax2.set_ylabel('Prize Share', fontsize=16, color='green')
plt.setp(ax3.get_yticklabels(), fontsize=14)
ax2.invert_yaxis()
plt.show()

In addition, we learned many new things too. We looked at how to:

  • Using shell commands with the ! or % characters (see saved links in folder "IPython magic commands ! $ % etc - run shell commands"), e.g.
!pip install --upgrade plotly
  • Create a subset separately and then re-use it, e.g.
col_subset = ['year','category', 'laureate_type',
              'birth_date','full_name', 'organization_name']
df_data.loc[df_data.birth_date.isna()][col_subset]  # we can reuse a subset like this!
  • The .str.split() method returns a Series where values are lists of two items such as [1, 1] (unless we use the expand=True parameter). Then we use .map() or .apply() and give it a lambda function to get hold of the first or second element. See StackOverflow post. E.g.
numerator = df_data.prize_share.str.split('/').map(lambda x: x[0]).astype('int64')
denominator = df_data.prize_share.str.split('/').map(lambda x: x[1]).astype('int64')
  • Use negation in the .loc[] conditions (in general with pandas and numpy, we use the bitwise NOT ~ instead of ! or not):
df_people = df_data.loc[not df_data.sex.isna()]  # where isna not True -> NOK
df_people = df_data.loc[~ df_data.sex.isna()]  # OK
  • Select a specific part of a DataFrame using .iloc[] -> df.iloc[row_start:row_end , col_start, col_end], e.g.
df_data.iloc[2:4, :6]
  • Use .fillna() to replace NaN values with a specific value, e.g. category_and_gender.fillna(value='Organization', inplace=True)

  • Using Numpy .arange() method to create a custom array of ticks for our graph. Also knowing that np.random.randint(0, 5, 10) returns 10 random values between 0 included and 5 excluded.

x_ticks = np.arange(1900, 2021, 5)
plt.xticks(ticks=x_ticks, fontsize=14, rotation=45)
  • Use .isin() to check values of a Series that are in another Series, e.g.
df_data.loc[df_data.birth_country_current.isin(top20_countries.birth_country_current)]
  • Use the .cumcount() method wisely. It only returns a Series, and the count starts at 0, e.g.
cumul_prize_country_series = df_data_sorted.groupby('birth_country_current').cumcount()
df_cumul_country['Cumulated Number of Nobel Prizes'] = df_cumul_country['Cumulated Number of Nobel Prizes'] + 1
  • Use the .copy() method when/if getting a warning about trying to create a column, e.g.
df_cumul_country = df_data_sorted[['year', 'birth_country_current']].copy()
  • Use .count() and then .cumsum(), and the ability to group by an index level, as an alternative to .cumcount(), e.g.
prize_by_year = df_data.groupby(by=['birth_country_current', 'year']).agg({'prize': 'count'})
cumulative_prizes = prize_by_year.groupby(level=[0]).cumsum()
  • Use float formatting in an f-string, e.g. print(f"blabla the age is {df_data.winning_age.mean():.0f} years old.")

  • Force-display several elements in the same block of code, without using print with : display()

  • Select the N largest or N smallest of a DataFrame with .nlargest(), .nsmallest()
display(df_data.nlargest(n=1, columns='winning_age'))
display(df_data.nsmallest(n=1, columns='winning_age'))
  • Create a Choropleth to display data on a map with plotly px.choropleth() and parameters like locations='ISO' and color_continuous_scale='matter'

  • Use plotly to display a whole bunch of lines at once (without doing a for loop like with matplotlib) with .line() and the color= parameter

  • Create Sunburst charts with plotly, e.g.

fig12 = px.sunburst(prize_orgs, path=['organization_country', 'organization_city', 'organization_name'],
                    values='nb_of_nobel_prizes', height=1200,
                    title='Nobel Prizes distributed by Research Institutions')
fig12.show()
  • Create a histogram with Seaborn and the .histplot() method, as well as giving a pre-defined style and override some of its values with rc={}, and changing the number of bars/bins with bins=, and more:
plt.figure(figsize=(8,4), dpi=200)
# using ticks style (so xticks appear) and overriding background color
with sns.axes_style('ticks', rc={'axes.facecolor': '#EAEAF2'}):
    hist = sns.histplot(x=prize_per_age.winning_age, bins=30)
    hist.set(xlabel='Age of Nobel Prize Laureates')
    hist.grid(color='white', linestyle='--', axis='y')
# plt.xticks(ticks=hist_x_ticks)
plt.show()
  • Create a regression plot with Seaborn with a moving average line using the lowess=True parameter (statsmodels package must be installed though).

  • Create box plots with Seaborn and sns.boxplot()

  • Remove legend in Seaborn or matplotlib plots with plt.legend([],[], frameon=False

  • Use Seaborn's .lmplot() and show scatter and regression/best-fit lines across multiple categories using the hue, height, aspect and lowess parameters, either in one window or in multiple with col, col_wrap and row.